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FALSE VARIABLE SELECTION RATES IN REGRESSION 

By Max Grazier G'Sell*, Trevor Hastie^* and Robert Tibshirani^ 

Stanford University 

There has been recent interest in extending the ideas of False 
Discovery Rates (FDR) to variable selection in regression settings. 
Traditionally the FDR in these settings has been defined in terms of 
the coefficients of the full regression model. Recent papers have strug- 
gled with controlling this quantity when the predictors are correlated. 
This paper shows that this full model definition of FDR suffers from 
unintuitive and potentially undesirable behavior in the presence of 
correlated predictors. We propose a new false selection error crite- 
rion, the False Variable Rate (FVR), that avoids these problems and 
behaves in a more intuitive manner. We discuss the behavior of this 
criterion and how it compares with the traditional FDR, as well as 
presenting guidelines for determining which is appropriate in a par- 
ticular setting. Finally, we present a simple estimation procedure for 
FVR in stepwise variable selection. We analyze the performance of 
this estimator and draw connections to recent estimators in the lit- 
erature. 

1. Introduction. Since the introduction of False Discovery Rates (FDR) (Benjamini and 
Hochberg, 1995), the idea has had a large impact on error control for many statistical problems and 
has inspired many further statistical developments (e.g. Tusher, Tibshirani and Chu, 2001; Efron, 
2010; Dudoit and Van Der Laan, 2007). More recently, there has been an interest in generalizing 
the ideas from FDR to variable selection in the regression setting. 

Abramovich et al. (2006) introduce the idea of FDR in the regression setting as a criterion for 
variable selection, and gives results about the asymptotic minimaxity of this method. The results 
focus on the case where the variables being considered are orthogonal. Since then, there has been 
work extending the idea of FDR in regression to the correlated variable setting. These works include 
Benjamini and Gavrilov (2009), which proposes a generalized FDR-based penalty to guide variables 
selection; Lin, Foster and Ungar (2011), which proposes a procedure for variable screening in regres- 
sion that controls an FDR-related quantity under certain conditions; Meinshausen and Biihlmann 
(2010) and Shah and Samworth (2012) on stability selection; and others (e.g. Meinshausen, Meier 
and Biihlmann, 2009; Wu, Boos and Stefanski, 2007). 

In this paper we consider linear models of the form 

(1.1) yi = Po + xJ/3 + Si, i = l,...,n, 

with Xi G W , yi £ W 1 , (3 £ MP , and £j independent and identically distributed. 

For variable selection, we denote the selected set of variables as a subset A C {1, . . . ,p} of the 
potential variables. The number of false selections, denoted by V (to be defined carefully in Section 
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2), is then a property of the set A. Similarly, the proportion of false selections, V j \A\, is also a 
property of the set A. 

The rate of false selections is the expected value of that proportion of false selections, IE (V/ \ A\), 
where the expectation is taken over realizations of data and conditional on the variable selection 
procedure. As a result, false discovery rates are not a property of a particular selected set A, but of 
the variable selection procedure and the structure of the model. In the following sections, we refer 
to proportions and rates of false selections in this way. 

In the literature, a false selection in the regression setting is usually defined as a selected variable 
that has a zero coefficient in the full model (e.g. Lin, Foster and Ungar, 2011; Meinshausen and 
Buhlmann, 2010; Meinshausen, Meier and Biihlmann, 2009; Wu, Boos and Stefanski, 2007). That 
is, for a set of selected variables A C {1, . . . ,p} and full model 1.1, the proportion of falsely selected 
variables is defined as 

FDP = \{jeA:0j = Q}\/\A\. 

The FDR is the expectation of this quantity for the given procedure. For this paper, we refer to this 
as the full model definition of a false selection and FDR. In contrast to this definition, in Section 
2.2.3 we introduce new quantities, the False Variable Proportion (FVP) and its expectation, the 
False Variable Rate (FVR), which are defined in terms of the projection of the full model onto the 
selected variables A. 

The FVR quantity we introduce is motivated by practical issues that arise when applying FDR to 
large screening problem such as gene expression studies. In these settings, the presence of correlated 
predictors can lead univariate FDR methods to select variables that are all capturing the same 
underlying signal. The desire to select unique variables and to detect multivariate effects has led 
to the use of regression variable selection techniques (Broman and Speed, 2002). When the full 
model definition of FDR is applied in these settings, it is difficult to distinguish the significance 
of the correlated predictors, which can inflate their FDR. This has led to an exploration of other 
approaches to defining false selections in the correlated variable setting (Frommlet et al., 2012). 

In this paper we demonstrate shortcomings of existing definitions of false selection rates in 
regression, and propose a new error criterion, called the False Variable Rate (FVR), which we show 
exhibits more desirable behavior. In Section 2, we discuss what constitutes a falsely select variable, 
introduce our new criterion and examine the differences in these definitions when the predictors 
are correlated. In Section 3, we present intuition for the differences in behavior of the error criteria, 
and provide more concrete examples where our new criteria behaves desirably. Finally, in Section 
4, we present a simple estimation algorithm for FVR in stepwise regression. We provide motivation 
for this estimator and examine the regimes in which it breaks down, making connections to broader 
issues in FVR estimation. 

2. Defining a false selection. We are interested in examining the population definition of 
a falsely selected variable in our regression model. We begin with a toy example of our problem 
setting, and then move on to a discussion of different definitions of a false selection 

2.1. Toy Example. This toy example may be helpful for gaining an intuition of the alternative 
false selection definitions, and for understanding the general purpose of these criteria. 

Imagine analyzing gene expression data and trying to understand some biological outcome as 
a function of that expression. In our simplified example, illustrated in Figure 1, we observe the 
expression of eight genes, A, B\, B2, -B3, Ci, C2, C3, D. Of these, genes A and B\ are biologically 
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Genes A* B* B 2 B 3 d C 2 C 3 D 




Selected Sets {A, B 2 ,Ci} {A,Bi,B 2 } 

Fig 1. Illustration of a simplified variable selection example. The eight genes represent variables available in our 
data set. Genes A and B\ (green and starred) are biologically relevant to the outcome of interest. The blocks indicate 
groups of highly correlated variables. From this data set, we consider two possible selected sets of variables. This paper 
addresses the meaning of false selections in sets like these. 

responsible for our outcome of interest, and they have corresponding nonzero coefficients in the full 
model that contains all the variables. 

However, suppose our data have further structure. Some of the genes occur in a common bi- 
ological pathway, leading them to be very strongly correlated. These groups are B\,B2,B^ and 
Ci,C2,Cs, illustrated by gray outlines in Figure 1. As a result, all the genes from one of these 
groups carry nearly the same information about the outcome and are very hard to distinguish 
based on experimental data. 

The figure illustrates two possible selected subsets of variables from the data set. Our goal is 
to understand how to assess the quality of selected sets like these by determining a meaningful 
sense of a correct or false selection, leading to false selection proportions and rates. As we see in 
the following sections, the presence of correlated predictors allows for different interpretations of a 
false selection. We refer back to this toy example to help convey the scientific implications of those 
definitions and interpretations. 

2.2. Definitions. In these sections, we describe three natural definitions of a false selection, two 
of which are common in the literature and the last which is newly proposed in this paper. 

Note: Because this section is focusing on the population definition of a false selection for a set of 
variables, rather than for a particular procedure, we focus on numbers (V) and proportions (FDP) 
of false selections, rather than rates (FDR). 

2.2.1. Marginal Correlations. The simplest definition of a false selection in our model is similar 
to the usual univariate approach to screening using marginal correlations. It defines the j th variable 
to be falsely selected if Cov (y T Xj) = 0. For a selected set of variables A, the number of false 
selections V and the false discovery proportion FDP are given by 

(2.1) V=\{j eA:Cov(y, Xj ) = 0}\, 

(2.2) FDP = |{j G A : Cov(y, Xj ) = 0}| / |.4| . 

This definition considers a selected variable interesting if that variable captures information 
about the signal on its own, irrespective of any of the other variables in the data set or in the 
selected model. In our gene expression example, any pathway with a gene that is important to the 
outcome will have all of the genes in that pathway considered correct selections, since they will all 
have marginal correlation with the outcome (Figure 2). 

This definition of a false selection is equivalent to the one used in many variable selection screening 
procedures, for example Tusher, Tibshirani and Chu (2001) and Efron (2010). We will not focus 
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Ci C 2 C3 D 
{A,Bi,B a } 
-/ 

Fig 2. Here we illustrate the marginal definition of a false selection in the context of our earlier example (Figure 1). 
We see that any variable that is marginally correlated with the outcome is considered correct. This includes B2 in both 
sets, since it is correlated with B\ which is in turn correlated with the outcome of interest. 

much on controlling this false selection rate, since it can be handled with the standard univariate 
FDR tools that have been very well discussed in the literature. 

2.2.2. Full Model Definition. As mentioned in the Section 1, the literature usually defines a 
falsely selected variable as one which has a zero coefficient in the full model. In the notation from 
Equation 1.1, this means that V and FDP are given by 

(2.3) V = \{jeA:(3j = 0}\ 

(2.4) FDP = \{j £ A : (3j = 0}\ / \A\ . 

This definition has been used in several papers, among them Meinshausen and Biihlmann (2010); 
Meinshausen, Meier and Biihlmann (2009); Wu, Boos and Stefanski (2007). A modified version 
of the FDR appears in Lin, Foster and Ungar (2011), using this definition of the number of false 
discoveries V. 

This definition of a false discovery is natural, particularly in the setting with uncorrelated Xj. In 
that setting, it actually agrees with the definition in Equations (2.1, 2.2). When the Xj are correlated, 
the meanings of these two definitions differ. The definition in terms of marginal correlations, as we 
mentioned, asks if each selected variable captures information about the signal on its own. 

Genes A" B* B 2 B 3 d C 2 C 3 D 
Selected Sets {A, B 2 ,Ci} {A,Bi,B 2 } 

fuii v^X X X 

Fig 3. Here we illustrate the full model definition of a false selection in the context of Figure 1. Here only variables 
A and B\ can be considered correct detections, since they are the only variables with nonzero coefficients in the full 
model. 

In contrast, the coefficient corresponding to a variable in the full model is only nonzero if that 
variable captures information about the signal that is not captured by any other variables in the 
full model. The proportion of false selections in this setting therefore corresponds to the fraction 
of selected model variables that fail to uniquely capture signal among all the variables in the 
considered data set. In our gene expression example, only the gene from an important pathway 
that has nonzero coefficient in the full model will be considered important (in our case A or Bi, 
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Genes A* B* B 2 B 3 
Selected Sets {A, B 2 , Ci } 

Marginal V" X 



see Figure 3). This can be counterintuitive, since any one of the genes from these pathways could 
be strongly predictive of the outcome of interest. 

Furthermore, in practice it would likely be impossible to determine which of the highly correlated 
variables in an important pathway actually carried the nonzero coefficient in the full model, which 
would lead to all the variables appearing as false selections according to this criterion. 

This is a reasonable choice for some statistical problems and some scientific settings. It has a 
strong connection to the full model p-values in regression, where significance of a coefficient shows 
that a particular variable is significantly correlated with the response after the effects of all the 
other variables have been removed. However, as we discuss further in Section 3, this criterion has 
unintuitive behavior for many of the scientific problems in which variable selection is being applied. 
In the next section, we introduce a new criterion which is more appropriate for those settings. 

2.2.3. A new approach: False Variable Rate. In this section, we propose a new approach for 
defining false selections which lies between these two extremes. Rather than requiring that an 
interesting variable be correlated with the signal in a way that is not explained by any other 
variables in the data set, we instead consider a variable to be an interesting selection if it captures 
signal that has not been explained by any other variable in the selected model. 

For many of the situations where variable selection is applied, this is a more natural view. 
Common variable selection approaches like stepwise or L\ regression attempt to include variables 
that capture part of the signal that the other selected variables miss. However, neither of these 
methods check that a variable being included captures signal that excluded variables do not also 
capture. For applications like the screening of predictors in biology, where predictors may be strongly 
correlated and the data matrix may not be carefully structured with meaningfully chosen columns, 
this is a more interpretable criterion. We come back to this idea when we contrast the methods 
more carefully in Section 3. 

We now define a criterion with the desired behavior. Rather than looking at the coefficients 
of the full model as in Section 2.2.2, we instead look at the coefficients of the model formed by 
projecting the true model onto the selected variables. This resembles some ideas from Berk et al. 
(2012), where inference is conducted with respect to the selected model even if there may be some 
larger true model. 

For a selected set A C {1, . . . we have a restricted data matrix X4, formed by the columns 
with indices in A. We can project the mean Xf3 from the full linear model onto this subset of 
predictors Xj^. This gives a projected mean X^/3^\ for some /3 £ M'" 4 '. In the event that of 
this form is not unique, meaning that X4 is not full rank, we choose to be any of the sparsest 
vectors satisfying the projection requirement. 

We now define a selected variable to be a false selection if it has a zero coefficient in this vector. 
This means that the number of false selections is just the number of zeros in this vector of coefficients 
for the projected mean, giving 



where we refer to the proportion of falsely selected variables by this definition as the false variable 
proportion (FVP) to differentiate it from the usual regression definition (FDP) in Section 2.2.2. 
Similarly, the expectation of this proportion is referred to as the False Variable Rate, or FVR. 

This quantity has the desired interpretation. A variable is considered interesting if it is correlated 
with the signal y after the effects of the other selected variables have been removed. If two variables 



(2.6) 



(2.5) 




5 



Genes A* B* B 2 B 3 Ci C 2 C 3 D 
Selected Sets {A, B 2 ,Ci} {A,Bi,B 2 } 

Projected ^/ ^ % >/ X 

Fig 4. Here we illustrate our new projected model definition of a false selection in the context of Figure 1. We see 
that variables are now correct selections if they are capturing unique signal among the selected variables. Thus B2 
is correctly selected in the first set. However, B2 is considered a false selection in the second set because it adds no 
information beyond B\ . 

are capturing the same piece of signal, including either one of them will be a good selection, but 
including a second one will not be adding any new information, and will thus be a false selection. 

In our gene expression example, this means that one gene selected from a given influential 
pathway will be considered a correct selection and any further selections from that pathway will 
be incorrect. This is illustrated in Figure 4, where we see that the selection of B2 in the first set 
is considered correct, because it is adding information to the selected set. In the second set, B2 is 
considered incorrect, because B± is already contributing the same information about the outcome. 
This seems like a natural definition in this setting. In Section 3, we elaborate on the differences 
between the criteria in detail. 

Note: Some care needs to be taken for models with random X, where we want to rule out spurious 
correlations between the random predictors. The number of correct selections in A is generalized 
to be the size of the smallest subset B C A such that, when conditioning on B, y is conditionally 
uncorrelated with the rest of A\B. This has a nice form for Gaussian graphical models, discussed 
in Section 3.2. 

2.3. Summary and comparison. Before moving on to discussion of implications and behavior of 
the different error criteria, we briefly summarize the three definitions that we have discussed, and 
their simple interpretation. Their implications for the gene expression example of Section 2.1 are 
shown in Figure 5. 

Ci C 2 C3 D 
{A,Bi,B a } 

X 

v'V X 

Fig 5. A summary of the implications of the marginal, full, and projected model definitions of a false selection on our 
example from Figure 1. We see that there are cases in which each of the definitions disagree with the others. 



Genes A* B* B 2 B 3 

Selected Sets {A, B 2 ,Ci} 

Marginal ^ ^ X 

Full V^X X 

Projected X 



Marginal view (Section 2.2.1). A variable Xj is considered a false selection if Cov(y,Xj) = 0. This 
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implies that a variable is interesting if it is correlated with the signal, without regard to any of the 
other variables in the data set. 

Full Model view (Section 2.2.2). A variable Xj is considered a false selection if (3j = in the full 
model. This implies that a variable is interesting if it is correlated with the signal after conditioning 
on all the other variables in the entire data set. 

Projected Model view (Section 2.2.3). A variable Xj is considered a false selection if /Sj = in 
the projected model onto the selected variables Xj±. This implies that a variable is an interesting 
selection if it is correlated with the signal after conditioning on all the other variables in the set of 
selected variables. This definition is used for our proposed False Variable Rate (FVR) criterion. 

In the next section, we will see the impact of the differences in these definitions on the behaviors 
of the error criteria. 

3. Contrasting the False Selection Criteria. In this section, we will discuss the behavior 
of the usual full model definition and our new projected model definition of false variable selection 
rates. We will see that, though the full model definition is reasonable in some settings, it leads 
to non-intuitive behavior in common variable selection settings. We will show that our proposed 
approach has intuitive and desirable behavior in those cases. 

3.1. A simple example. To understand the differences in behavior between these definitions, we 
begin with the following simple example, represented in Figure 6 




A B C D 

Fig 6. Representation of four possible variables. The projection of the true (noiseless) model into the space spanned by 
the variables is indicated as the green circle and label X/3. Variable 3 is perfectly correlated with the signal, variables 
1 and 2 are correlated with the signal, and variable 4 is orthogonal to the signal. The red arrows indicate the variables 
that have been selected. 





Marginal FDP 


Full Model FDP 


Proposed FVP 
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Figure D 





2/3 


2/3 



Table 1 

Resulting false selection proportions from applying each of the three definitions of Section 2.2 to the scenarios in 
Figure 6. The criteria disagree on values in Figures B and D because the definitions give different value to correlated 

variables that explain the same part of the signal. 

We consider four different simple cases here. Note that we are examining proportions of false 
selections for a selected set, rather than rates of false selections for a procedure, so we will describe 
FDP/FVP quantities, rather than FDR/FVR rates. The false selection proportions for each scenario 
and each definition are shown in Table 1. 

All three definitions agree on scenarios A and C, since they deal only with variables that are 
perfectly correlated or orthogonal to the true signal. The cases B and D are more interesting. In case 
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B, variables 1 and 2 are correct selections according to the projected definition, since they capture 
information about the signal that is not included in any other selected variable. These variables 
are both considered false selections by the full model definition, since the data set (though not the 
selected set) contains variable 3, which captures all the information in variables 1 and 2. 

In scenario D, the full and projected model definitions now agree, since variable 3 is included in 
the selected set, rendering variables 1 and 2 uninformative. The marginal definition continues to 
consider all three variables correct selections, since it is not concerned with uniqueness. 

We see that the definitions may all disagree, depending on the structure of the data set and 
the selected model. In general, the full and projected model approaches differ when variables are 
selected that are correlated with the signal, but would have their correlation explained away by an 
unselected variable in the data set. 

3.2. Graphical Model View. The interpretation of and differences between these false selection 
definitions are particularly clear in a Gaussian graphical model setting. Suppose that the variables 
Xj and the response Y have a joint Gaussian distribution, with distributions X ~ iV(0, S) and 
Y ~ N(X T f3, a 2 ). We represent the dependence structure of the variables by the usual dependence 
graph, as illustrated in Figure 7a. Two variables are connected here if they have nonzero partial 
correlation after conditioning on the other variables. 




Fig 7. Example of a dependence graph. Plot (a) shows the dependence graph for the full model. Plot (b) shows the 
induced dependence graph for the marginal distribution of the selected variables A = {2, 3, 5, 7} and Y . 

If we select a subset of the variables A, there is an induced dependence graph for the marginal 
distribution of y U A. The structure follows from the usual manipulations of Gaussian covariance 
matrices (see Appendix A.l). This marginal graph corresponds to the dependence structure of the 
variables in the projected model we discussed earlier. 

Each of our definitions of a false selection have an interpretation in terms of these graphical 
models. Suppose we select variables A = {2, 3, 5, 7} in the example shown in Figure 7. In the 
marginal definition, a variable in A is a false selection if it does not have a path to y in the full 
graph (a), since such a path would correspond to a marginal correlation with y. In this case, all 
the variables are connected to y except variable 7, so there would be one false selection by that 
definition, and FDP = 1 /4. 

In the full model definition, a variable in A is a false selection if it is not directly connected to y 
in the full graph (a), since such links correspond to nonzero partial correlations. By this definition, 
variables 2,5,7 would be false selections, and the FDP = 3/4 for this approach. 

In our projected model definition, a variable in A is a false selection if it is not directly connected 
to y in the graph (b) induced on the selected variables. The links in that induced graph correspond 
to correlations that cannot be explained by other selected variables. In the example, variables 2 and 
3 are directly connected to y in the induced graph, and variables 5 and 7 are not, so the FVP = 1/2. 
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From these graphs, we gain an intuition for the behavior of the full and projected definitions. By 
switching from the full model to the projected model, represented by the induced graph, unselected 
variables that were important in the full model can induce importance in correlated variables. This 
lets selected variables that are carrying that missing information be considered correct selections 
when they are included. 

3.3. Implications for the Choice of False Selection Criteria. So far in this section, we have 
described the differences between the false selection criteria we have presented. In this section, we 
will discuss the practical implications of these differences in statistical problems. We show that in 
many common settings the usual full model FDP/FDR will give misleading results, and that our 
proposed FVP/FVR definition will have the intuitive behavior we desire. 

3.3.1. Highly correlated predictors. Suppose our dataset contains highly correlated predictors. 
This commonly occurs in biological data, where there might be an underlying factor driving several 
variables. In our gene expression example, several genes in our data set could be from the same 
biological pathway, leading them to be over-expressed or under-expressed together. 

Now imagine that one of these pathways is biologically relevant to our outcome of interest. This 
could lead to several variables which are strongly correlated with the outcome, but are also highly 
correlated with one another. In the setup we just described, it may not be possible to distinguish 
which of these variables has the "true" relationship with the signal variable. In that case, what 
behavior do we desire from an error criterion and how do the criteria we have proposed behave? 

For the usual univariate definition of a false discovery, all of these correlated variables will be 
considered correct detections, since they carry some information about the variable. However, this 
is not likely to be the behavior we are interested in, since the use of a multivariate model in the 
first place expresses interest in capturing "unique" signal of some kind. 

In the full model definition of a false discovery, each of these variables is "interesting" only if they 
capture unique signal among all the variables being considered. Because of the high collinearity, it 
will be impossible to distinguish a unique signal that is captured by any one of the variables. As a 
result, they would all appear as false selections in practice. 

Furthermore, if the selection procedure were guided by an attempt to control the full model FDR, 
it would discourage the selection of any of these variables because of the other highly correlated 
variables in the data set. As a result, it would be likely that such a procedure would fail to select 
any of these variables, even though any one of the variables could carry most of the predictive 
power of the data set. 

Contrast these behaviors with that of the proposed FVR criterion. As we have discussed, a 
variable is considered interesting by FVR if it captures unique signal among all the selected variables. 
Consider a selected set that contains only one of the highly correlated and predictive variables. 
While the full model definition would consider it a false detection, the FVR definition considers it 
a correct selection because it is adding unique explanatory power to the selected set of variables. 
Furthermore, if several of these correlated explanatory variables are added, only one of them (it 
does not matter which) will be considered a true detection by FVR and the others will be considered 
false. This makes intuitive sense, as the group of correlated variables should be able to contribute 
"one variable" of explanatory power. 

We believe that in many settings, this is the interpretation that is being sought for a false dis- 
covery rate. It is helpful to have a criterion that agrees with this interpretation, and it is convenient 
that defining the criteria in terms of the projected model gives this interpretation. 
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3.3.2. Stability to the set of considered variables. Another non-intuitive property of the full 
model FDR concerns its stability to changes in the data set. Suppose that our data set contains 
just two variables (call them 1 and 2), each of which captures some of the signal in the dependent 
variable. This scenario is shown in Figure 8(a). Imagine that a variable selection procedure selects 
both of these variables. Under the full model definition of FDR, both of these selections are con- 
sidered correct. Now suppose that the data set had included another variable 3 that is very well 
correlated with the signal, so that variables 1 and 2 are uncorrelated with y conditional on variable 
3 (Figure 8(b)). Then under the full model definition of FDR, both variables 1 and 2 would now be 
considered false detections, even if variable 3 was not selected] The FDP of the selected set changes 
from to 1, without any change to the variables included in that selected set. 

1 

/ 

/ • XB 



(a) (b) 

Fig 8. Simplified example demonstrating full model FDR sensitivity to excluded variables. In scenario (a), both 1 and 
2 would be considered correct selections. In scenario (b), both 1 and 2 are considered false selections, whether or not 
variable 3 is selected. 




This emphasizes that the full model FDP is not a property of a selected set, but of both a 
selected set and of the full data set being considered. The FDP is incredibly unstable to changes in 
the full set of variables being considered. The implications of this are worrisome in settings like gene 
screening, where the variables being measured may be a matter of convenience and the existing 
microarrays, rather than careful design for a particular experiment. Meaningful interpretation of 
the full model FDR depends very heavily on an understanding of the entire data sample that was 
collected and analyzed. 

3.3.3. Summary. In this section we described practical examples of how the full model definition 
of FDR clashes with the intuitive interpretation of a false discovery rate when the predictors are 
correlated. We saw that when several correlated variables are capturing essentially the same signal, 
the FDP can be unstable and, in the presence of noise, the FDR for the selected variables can be 
high even if the variables are capturing a strongly predictive signal. In contrast, the FVP and FVR 
behave intuitively in the presence of correlated explanatory variables such as these, counting the 
first such variable to be selected as an interesting selection, and any of the following such variables 
uninformative. 

In addition, we showed that the FDP and FDR are highly sensitive to all the variables included 
in the data set, even those variables that are not selected. As a result, the usefulness of the full 
model FDR depends on a careful understanding of all the variables being considered in the data 
set. The FVR avoids some of this trouble, since it is not affected by unselected variables in the 
data set. 

It is worth noting that there are cases where the full model FDR is the correct definition to 
use. First, if the predictors are uncorrelated, all three definitions of a false selection are equivalent. 
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Second, there are scientific setups where the interpretations described above are the desired ones. 
Suppose one has carefully selected all the variables being considered, and that a guarantee is desired 
that any selected variable is actually uniquely related to the signal among all the variables being 
considered. Then one might desire to only select a variable if its relation without outcome y stands 
out among all the variables. If two variables cannot be distinguished, the scientist might not wish 
to select either until enough data can be gathered to distinguish between them. In that setting, 
the FDR definition will have the proper interpretation. However, we feel that in the majority of 
experimental setups, particularly experiments where the focus is on screening, the FVR definition 
is more in line with research goals and intuitive interpretations. 

We conclude this section with a simple simulation. This simulation demonstrates the differences 
in behavior between FDR and FVR when evaluating stepwise regression on a set of correlated 
variables. The simulation is constructed to fit the setting of Section 3.3.1, where predictive variables 
are present but appear in strongly correlated groups. 

3.4. Simulation Example. To see the difference in behavior between the full model definition 
and our projected model definition, consider the following simple setup, illustrated in Figure 9. 




Fig 9. Illustration of the simulation setup for Section 3.4- Nonzero elements of E and ft are shown by the shaded 
rectangles, and the graph illustrates the corresponding joint dependence structure of X and y. 

We create several blocks of variables, each of which is highly correlated internally. For a subset of 
the blocks, we select one variable within the block to have a nonzero coefficient in j3. The structures 
of the resulting matrices are shown in 9, along with the corresponding dependence graph. 

We choose this setup because we expect it to demonstrate a strong difference between the meth- 
ods. The correlation within the blocks is strong enough that the selection method will be unable 
to distinguish the true signal variable within each correlated block, but the signal associated with 
the nonzero coefficients is large enough to be detected at the group level. As a result, selection 
methods should be able to pick those blocks that have signal, but choosing the "correct" variable 
within each block should be nearly random. By the usual full model definition of a false selection, 
these selections will be counted as false. By our new definition, it is recognized that any variable 
within that block carries essentially the same signal, so the first selection within each block will be 
considered correct. 

The plots of the true population FDR and FVR are shown in Figure 10. This particular simulation 
has 20 blocks of two variables each. Ten of those blocks are selected to have a nonzero coefficient 
in P for one of the variables. The correlation within each block is 0.95, the additional noise on y is 
0.8, and the total number of observations is n = 50. 

We see that the usual FDR criterion finds a false discovery rate of about 40% for the first 10 
selections, while our new criterion gives an FVR of nearly zero for the first 10 selections. These ten 
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FVR and FDR Population Values 




Fig 10. True full model FDR (black) and projected model FVR (red) for the simulation in Section 3.4- Stepwise 
selection is carried out on a simulation with 20 pairs of correlated variables, 10 of which contain one important signal 
variable. We see that FVR considers either variable of an important pair correct, while FDR does not. Monte Carlo 
standard errors are shown. 



selections correspond to correctly selecting the ten blocks with signal, though not necessarily the 
variable within those blocks with nonzero (3 coefficient. 

While there are situations where both criteria could make sense, we believe that in many settings, 
the value reported here for FVR is more in line with peoples interpretation and goals in these 
settings. The first ten variable selections were highly predictive about y, and it is often reasonable 
to have a sense of false discovery rate that is in line with this fact. 

4. Estimation of False Variable Rates. In this section, we discuss estimation of FVR. 
Classically, estimation of FDR in the regression setting has been quite difficult when the variables 
are correlated (Lin, Foster and Ungar, 2011, e.g.). We expect the FVR to be easier to control, since it 
is more closely related to traditional variable selection procedures. Nevertheless, more development 
of good estimation and control procedures is needed. 

In this section, we will construct a very simple, illustrative estimate of FVR for stepwise re- 
gression. This is not intended to be the ideal estimator of FVR, it should instead be viewed as 
an illustration of FVR and how one might approach its estimation. Through simulation, we will 
demonstrate that this simple estimator works reasonably well, even with correlated predictors, 
providing further evidence that FVR may be an easier target for control. 

We will also examine the assumptions behind this estimation scheme and the regimes in which 
this simple estimation procedure breaks down. These weaknesses are instructive, as our procedure 
shares those limitations with several recent methods for controlling false selections in regression. 
We hope that a better understanding of this simple estimator can inform the construction of future 
estimators. 
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4.1. Motivation for the Estimator. The motivation for this algorithm comes from the idea that 
at each step of stepwise regression, the procedure admits the variable that appears to capture 
the most signal once the effects of the other currently selected variables have been removed. This 
approach is in the same spirit as the FVR criterion we have been discussing. 

At each step, we can imagine the hypothesis that the new model is a true improvement over the 
old. This is the usual statistical hypothesis comparing two nested models, where we test whether 
the coefficient of the new variable is zero in the larger of the two models. We will neglect for a 
moment that the selection of the variable would make the usual tests for this hypothesis invalid. 

If we look at the number of these incremental hypotheses that are "null," this seems similar 
to the number of false selections as defined in the FVR. While this statement is not necessarily 
true in reality, there is a relationship between these two quantities. We will make this relationship 
more explicit later in this section and justify it more carefully. Using this relationship, our simple 
algorithm will amount to estimating the number of null incremental hypotheses that were traversed 
in arriving at our selected model. 

4.2. Algorithm. Before delving into a more careful justification of our algorithm, we present it 
in its complete form here. 

To avoid issues with inference after selection, this algorithm relies on random splits of the data 
to separate the selection and inference stages. The following algorithm describes the action for one 
random split of the data. 

1. Split the data randomly into two pieces, call them X^ and X^ 2 \ For convenience we use 
even splits, though this is not necessary. 

2. On X^ l \ fit a stepwise selection path. This gives an ordering V to all p of our variables, by 
the order in which they are selected. Notationally, we define Vj to be the set of the first j 
variables in the ordering. 

3. Define the incremental hypotheses Hj as follows. Let £j be the variable added at the j th 
step in V . Let be the coefficients of the model projected on X<p., the j variables selected 
by step j. Then 

Hf ' : #>> _ 0, 

which is just the hypothesis that the j th addition was not a useful one (at the point that it 
was added). 

(v) 

For each hypothesis JET • , we can obtain a p- value pj through the usual F or t test of the 
nested models, using the data from X^ 2 \ This inference is valid because our variables were 
not selected on X^ 2 \ Nonparametric tests, like those based on permutations or the bootstrap, 
can also be used here to avoid distributional assumptions. 

4. For each model size k, we now estimate the number of null hypotheses in our set of hypotheses 
{Hj ; j = 1, . . . , k}. We use a threshold estimator as in Storey (2002), giving the estimator 

MX) = #{Pj > A; j < k} 
k (1-A) 

for any threshold A 6 (0, 1). We will show that bounds the number of null hypotheses in 
{Hf\j<k}. 

5. The estimate from this split for the FVR for model size k is Vl ; /k. 
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We average these estimates, Vu /k, over many splits of X to obtain a final estimated FVR for 
each selected model size. 

In the remainder of this section, we will provide justification for this algorithm and present 
simulation results. 

4.3. Justification. In this section we will provide justification for the estimation procedure pre- 
sented in Section 4.2. Due to the length of some of these explanations, some of the details have 
been moved to Appendix A. 3 and summaries are presented here. 

There are three pieces of this algorithm that require justification. 

CP) 

1. That the hypotheses H- resulting from the selected path V are appropriate hypotheses to 

be looking at, in the sense that the number of null hypotheses in {H- } should correspond 
to the number of false selections appearing in the FVR definition. 

2. That we are estimating the number of true null hypotheses in the set {H^} in an appropriate 
way. 

3. That splitting the data give reasonable estimates of the quantity of interest. 
We will address each of these points in the following subsections. 

4.3.1. Appropriateness of Hj . Here we argue that the hypotheses {H- } corresponding to 
the steps of the selected path are reasonable hypotheses to consider, in the sense that the number 
of true null hypotheses in {Hj } should be a good estimate of the number of variables with zero 
coefficients in the final projected model. In cases where the variables are correlated, there is no 
reason that this should be true for the incremental hypotheses corresponding to a general ordering 
of the variables. 

The details of this argument can be found in Appendix A. 3. The general idea is that there exist 
particular orderings of the variables for which the number of null incremental hypotheses is exactly 
the number of zero coefficients in the projected model. Furthermore, the ordering produced by 
stepwise selection is not too far from these ideal orderings, so the resulting estimate is not badly 
biased. 

This dependence on the ordering of the variables has implications for FVR and FDR estimation. 
One is that the estimation method of Section 4.2 can only be extended to selection methods that 
provide a reasonable ordering of the variables. Similarly, one should be wary of potential bias in 
other methods that estimate FVR or FDR based on the incremental hypotheses along a variable 
ordering, particularly those that rely on random or arbitrary orderings. 

4.3.2. Justification of the threshold estimator. The next piece of the algorithm uses a threshold 
estimator, as in Storey (2002), to estimate the number of true null hypotheses in {Hj } based on 
the p- values pj . 

We can show that the expectation of the threshold estimator VL is an upper bound on the 
number of true null hypotheses in {Hj ;j < k}: 



E ^ (A) = r^A 2 EIp > > A - r^A 2 E/pj >x = k : Hj true null} • 

j=l 3=1 

Hj true null 
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This bound depends only on the p- values from the true null hypotheses, so the bound will be 
loose when there is a large contribution from the false null hypotheses. This will happen when the 
Pi corresponding to the false nulls have a significant probability of exceeding A, or when there are 
a large number of false null hypotheses in the selected set. This is demonstrated in the simulation 
in Figure 13. 

This also suggests a bias- variance trade off when selecting A, as a large A will give a smaller bias 
but a larger variance. Appendix A. 2 mentions a bootstrap approach to calibrating A, like that of 
Storey (2002). 

4.3.3. Effects of splitting. Because we split our data and use the two halves in our estimator, 
the quantity we are estimating actually corresponds to the FVR for a data set with half as many 
observations. As in cross validation, it is reasonable to wonder what effect this has on our estimate. 

The sample size influences the true FVR only through differences in the resulting variable order- 
ings from stepwise selection. As a result, we expect the true FVR values for both sample sizes to 
be reasonably close, and for the uncertainty in FVR estimation to dominate the difference in most 
cases. This is supported by the simulations of Section 4.4. 

In the cases where the true FVR values do diverge, the value for the half-sized data set will 
be larger, leading our estimate to be conservative. We observe this for particularly noisy data in 
Figure 14. 

4.4. Simulation. Here we present simulations of the performance of our estimation method for 
FVR in stepwise regression. We will see that the method performs well overall. We will also simulate 
under parameters specifically selected to demonstrate the potential biases discussed in Section 4.3. 

For these simulations, we use blocked settings similar to those used in Section 3.4 and illustrated 
in Figure 9. These examples are constructed to clearly illustrate the difference between FVR and 
FDR, and to show how our estimates relate to those quantities. To do so, the parameters are chosen 
so that the blocks of variables will be reasonably significant, but the variables within the blocks 
are correlated enough to be difficult to distinguish in the presence of noise. For convenience, the 
parameters of all the simulations are laid out in Table 2. All of these simulations use A = 0.5 for 
the threshold in the estimator. All estimates are obtained by averaging 50 splits of the data set. 
All curves and standard errors are the result of 100 Monte Carlo simulations. 





n 


# blocks 


# per block 


# signal 


o> 


P 


Figure 11 


100 


20 


2 


6 


0.8 


0.95 


Figure 12 


100 


5 


3 


3 


0.5 


0.95 


Figure 13 


100 


20 


2 


10 


0.5 


0.95 


Figure 14 


100 


20 


2 


10 


2 


0.95 



Table 2 

Parameters for the simulation settings use to make Figures 11, 12, 13 and 14- The parameters are number of 
observations, number of blocks of variables, number of variables per block, number of blocks where one variable is 
made significant, noise variance, and within block correlation. The coefficient for any significant variables is fixed at 

1 across all simulations, and all estimators use X — 0.5. 

Ideal performance of our estimate can be seen in Figure 11. The plot shows the true FVR for 
both the full sample size and the half sample size (in black and red, respectively), along with 
the true FDR (dotted black) and our estimate (green). We show both of these true FVR values to 
demonstrate that splitting our sample has not dramatically altered our target quantity, as discussed 
in Section 4.3.3. 



15 



FVR estimation with repeated splits 



FVR (full) FVR Estimate 

FVR (half) - - FDR 


-T-riiTTTTTT 







~i 1 1 1 r 

10 20 30 40 

Number of Variables 



Fig 11. Simulation of the FVR estimate for forward stepwise selection on blocked data in an ideal setting. The true 
full model FDR is shown in black, while the true FVR is shown in black for selection with the full sample size, and 
red for selection with a half sample size. The estimated FVR, shown in green, closely follows both of these true FVR 
curves. 



In the simulation shown in Figure 11, stepwise selection correctly selects the six groups with 
signal, but not the particular variable within each of those groups. Thus the FDR is reasonably 
large from the beginning, as we would expect from the construction of the example, while the FVR 
remains low. We see that the FVR estimate, using the method from Section 4.2, closely matches 
the true FVR for both sample sizes. 

The simulation in Figure 12 illustrates the low bias in our estimator due to incorrect ordering 
of the variables, as discussed in Section 4.3.1. This simulation is constructed to have a few signal 
variables, along with many very strongly correlated noise variables in relatively low noise. When 
the noise variables enter early in the path, they temporarily appear informative due to spurious 
correlations. This causes a downward bias in the estimated number of false selections. 

The simulation in Figure 13 illustrates the upward bias in the threshold estimator that was 
discussed in Section 4.3.2. Here many signal signal variables into the simulation. The FVR estimate 
(green) is biased upward from the true values (black/red), particularly at the start of the path. 
While this is worrisome, it is comforting that the dramatic bias is upward, leading to conservative 
estimates. The selection of A in this simulation was made without tuning to reduce this bias; 
bootstrap calibration as in Storey (2002) might help to reduce this bias. 

Finally, the simulation is conducted with a much higher noise level, shown in Figure 14. This has 
the effect of inflating the difference between true FVR values for the full and half samples. The half 
sample FVR (red) is now larger than the full sample FVR (black), implying that the additional 
information from the larger sample is important for obtaining good selections. Our estimate (green) 
is estimating the higher of these curves, and is therefore very conservative. 

We see from these simulations that the proposed estimator works reasonably well in simulation. 
The biases discussed in Section 4.3 do exist. The downward bias from relying on the stepwise 
selection ordering appears to be weak in practice and to occur mostly in the later part of the 
selection path, supporting our belief that stepwise selection is providing a reliable ordering. The 
upward bias from the threshold estimator of true null hypotheses appears when there are many 
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FVR estimation with repeated splits 




Fig 12. Simulation of the FVR estimate for forward stepwise selection on blocked data, demonstrating downward bias 
due to misordering of selected variables. The right side of the path shows a downward bias in the estimates (green) 
relative to the true FVR values (black and red). 



FVR estimation with repeated splits 
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FVR (half) - - FDR 
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Fig 13. Simulation of FVR estimate for forward stepwise selection on blocked data, demonstrating the upward bias 
of the threshold estimator. The left side of the path shows an upward bias in the estimates (green) relative to the true 
FVR values (red and black). 



signal variables present in the data, but it skews the results in a conservative direction. 

5. Conclusion. In this paper, we discussed the interpretations and implications of different 
definitions of false selection in the regression setting. We saw that these error criteria behave 
differently in cases where variables are correlated. In particular, we described difficulties for the 
standard full model definition, which lead to unintuitive or undesirable behavior in many cases. 

As a solution, we introduced a new false selection error criteria, FVR, which is defined in terms 
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FVR estimation with repeated splits 
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Fig 14. Simulation of FVR estimate for forward stepwise selection on blocked data in a high-noise setting. This 
setting shows that, in the presence of high noise, the lack of data in the half samples cause that true FVR (red) to 
be larger than the true FVR for a full sample (black). This inflates our estimator, since our estimator splits the data 
and actually estimates the half-sample FVR. 

of the projected model. This error criterion focuses on guaranteeing uniqueness of the explanatory 
variables only among the selected variables, rather than the entire data set. In doing so, it avoids 
the concerning behaviors of the traditional full model definition, leading to intuitive behavior in 
many settings. We presented several interpretations of FVR, demonstrating its differences from 
FDR and where each criterion might be appropriate to use. 

Finally, we presented a simple estimation method for FVR in stepwise regression. We showed 
that this method gave reasonable estimates of FVR over a range of simulation parameters. We also 
examined the regimes in which this estimator performed poorly, giving insight that could be helpful 
when constructing future estimators or assessing existing ones. 

The idea that each of the error criteria impose a different idea of an "interesting" selected variable 
is a convenient view. The full model definition considers a selected variable interesting if it explains 
signal that is not captured by any other variable in the data set. In contrast, the projected model 
definition (corresponding to FVR) considers a selected variable interesting if it captures unique 
signal only among the other variables in the data set. Contrasting the error criteria in this way 
could help provide an intuition of which criterion is most suitable to a particular problem. 

There is plenty of interesting work to be done in understanding this new error criterion. More work 
is needed to understand better estimation or control procedures. Very preliminary work suggests 
that other common variable selection procedures like the LASSO (Tibshirani, 1996) may control 
FVR well. Existing methods that seek to control FDR in the regression setting are also likely 
to have appealing FVR properties. In another direction, one can construct an analog to the False 
Negative Rate of Genovese and Wasserman (2002) which is related to FVR and might be interesting 
to consider. Our main goal in this paper has been to identify potential concerns with the accepted 
full model FDR definition when predictors are correlated, and to propose this new FVR criterion 
which may be more appropriate to consider in those settings. 
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APPENDIX A: DETAILS 



A.l. Gaussian formulation of FVP in terms of the covariance matrices. It is partic- 
ularly straightforward to compute the FVP in the joint multivariate Gaussian setting where the 
parameters are known. This is particularly useful when running simulations, so we include our 
approach here. 

Suppose that (x^\ . . . , x^ p \y) T is joint multivariate normal, with X ~ N(0, S) and y ~ N(X(3, a 2 ) 
Note that 

v v 

Covfa,*^) = E( X ^ yi ) = £ PrH^^) = E ^33' 

j'=l j'=l 
Cov(y h Vi ) = Cov(xj0, xf/3) + a 2 £ = E ^ ^ x ? h j (fl ^ j + a l 



J2p j p j ,E(x ( ?>x? ) ) = p T ?:p 



33 

This lets us construct the augmented covariance matrix for (x^\ . . . , x^ p \ y) T , S. 

v _ f S ^ 

Now suppose we select a set of variables *4. We want to assess the FVP of this set of variables. 
Let A + be A U y. The covariance matrix for the marginal distribution on A + is just £4+ _^+. We 
can compute the inverse ^ + , and the FVP is the number of zeros in the row corresponding to 

y- 

A. 2. Selecting A for the threshold estimator. In Section 4.2, we introduce a threshold 
estimator = ^ig^^^M f or estimating the number of true nulls in our set of hypotheses, and 
show that the expectation of this estimator provides a lower bound on the number of true nulls in 
{Hpi;j<k}. 

As mentioned in Section 4.3.2, there is a bias-variance trade-off involved in selecting A. As A 
increases, the probability of a false null hypothesis entering will decrease, decreasing the bias in 
the estimator. However, the probability that a true null hypothesis is counted will also decrease, 
increasing the variance of the estimator. 

In Storey (2002), a bootstrap-based approach is presented for tuning A in a similar threshold 
estimator of pFDR. It can be shown that an equivalent condition, > miny > 

holds for our estimator, so a similar tuning approach could be used to select A for a particular 
application. This would mean choosing A to minimize 

B / \ 2 



6=1 v 



(A)*6 - tV(A') 
mm V, 



I: 



where b indexes B bootstrap simulations, each of which contributes a bootstrap estimate V^* b . 
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A. 3. Effects of variable ordering. Here we present more details in our discussion of variable 
ordering and the estimator of Section 4. 

Let A be the selected set of variables and V be the number of variables in A with zero coefficients 
in the model projected onto Xrj\, making V the numerator of the FVP for A. Let B C A be a 
minimal subset of the variables in A, such that the projection of the true model X(3 onto Ag is the 
same as the projection of that model onto Xjy. Note that V = \ A\ — \B\. 

Suppose that V is an ordering of the variables in A such that all the variables in B appear 
before any other variables. Then, except for very special cases, the incremental null hypothesis is 
false at each of the first \B\ steps along V . Furthermore, the subsequent \A\ — \B\ steps are true 
null hypotheses, since the projected model has been obtained once the variables in B have been 
included. This means that there will be exactly V = \A\ — \B\ true null incremental hypotheses in 

{Hj }, which is identical to the number of zero coefficients in the projected model. 

The special cases above refer to the unlikely cases where some of the variables in B have exactly 
zero correlation with y when a strict subset of B are conditioned upon. However, while the state- 
ments above will not hold in that case for all orderings where B appears first, there will still exist 
a subset of such orderings for which the statements hold. Furthermore, stepwise selection will tend 
not to select the invalid orderings, so the issue should not arise in practice. 

In practice, an ideal path P is not known. Our algorithm relies on the idea that the paths 
produced by stepwise selection are close to one of these perfectly-ordered paths. We can view a 
real path V as a modification of a "nearest" path P, where the path is modified by moving noise 
variables forward in the ordering, and the path P is nearest if it requires the fewest such moves. 
If t is the number of these erroneous moves, then at worst all t improperly inserted variables will 
appear significant at the time of their selection. In that worst case, the quantity being estimated 
by looking at the incremental null hypotheses is actually FVP — t/A. 

This means that an improper ordering could bias the FVR estimate low, by as much as t/A. 
The fact that this bias is downward is worrisome, since it could potentially lead to over-optimistic 
estimates of FVR. In simulation, we have found that stepwise regression produces orderings that 
are quite reasonable, leading to small t and minimal bias. It could be interesting to investigate the 
settings in which t can be shown to be controlled by stepwise selection or other methods. 
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